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Abstract —Cardinality estimation is the task of determining the 
number of distinct elements (or the cardinality) in a data stream, 
under a stringent constraint that the input data stream can be 
scanned by just a single pass. This is a fundamental problem 
with many practical applications, such as traffic monitoring 
of high-speed networks and query optimization of Internet- 
scale database. To solve the problem, we propose an algorithm 
named HLL-TailCut+, which implements the estimation standard 
error 1.0/ \/m using the memory units of three bits each, 
whose cost is much smaller than the five-bit memory units 
used by HyperLogLog, the best previously known cardinality 
estimator. This makes it possible to reduce the memory cost 
of HyperLogLog by 45%. For example, when the target 
estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 
kilobytes memory. By contrast, our new algorithm only needs 3 
kilobytes memory consumption for attaining the same accuracy. 
Additionally, our algorithm is able to support the estimation of 
very large stream cardinalities, even on the Tera and Peta scale. 

I. Introduction 

Cardinality estimation is the task of determining the number 
of distinct elements in a data stream, which is presented as a 
sequence of elements and can be examined by only one pass. 
This problem has attracted significant attention over the past 
decades, due to its important role in many application domains, 
e.g., real-time traffic monitoring in high-speed networks 
[4], [10]—[13], [20] or in software-defined networks [21], 
query plan optimization in large-scale database [9], in-network 
query aggregation in wireless sensor networks [17], and file 
significance evaluation in P2P systems [18]. 

Practical Importance. In the domain of online traffic 
monitoring of high-speed networks, the cardinality estimation 
problem can be used to detect traffic anomalies, such as 
network IP/port scan and distributed denial-of-service (DDoS) 
attacks [10], [11], [20]. For instance, if we treat all the packets 
originated from a same source IP as a data stream, then we 
can detect whether this source IP is a network scanner by 
counting the number of distinct destination IP/port addresses 
in its outward packet stream. A similar estimator can be used 
to detect whether a server is under DDoS attack, if we treat 
all the packets towards a common destination IP as a data 
stream and estimate the number of distinct source addresses 
in this stream. For other application examples, a server farm 
may learn the popularity of its hosted contents by tracking 
the number of distinct users that request for each file, and an 
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institutional gateway may perform cardinality estimation on 
outbound URL requests to measure the popularity of external 
web content for caching priority. 

According to a recent paper [9], many data analysis 
systems developed by Google, including Sawzall, Dremel and 
PowerDrill, need to estimate the cardinalities of very large data 
sets (e.g., the number of distinct search queries on google) on 
a daily basis. As pointed out in [9], cardinality estimation over 
large data sets presents a challenge in terms of computational 
resources, and memory in particular; for PowerDrill, a 
non-negligible fraction of queries historically could not be 
computed since they exceeded the available memory. 

Prior Art. Although the cardinality can be easily computed 
using space linear in the cardinality, for many applications, 
this is impractical as it requires too much memory. Therefore, a 
large number of algorithms have been developed to produce an 
approximate estimation of the cardinality based on a summary 
or ’’sketch” of the data stream, whose occupied space in 
memory is merely on a sublinear level. Typical sketch-based 
algorithms include PCS A [8], MultiresolutionBitmap [6] (a 
generalization of LinearCounting [19]), MinCount [2], [3], 
LogLog [5], HyperLogLog [7], and just list a few. 

We make a quick comparison of existing cardinality 
estimators in Table I. In the third column, each register may be 
a partial machine word of a few bits, independently producing 
a coarse estimation of the cardinality (or say, a machine word 
may hold multiple registers). To mitigate the high variation 
of a single register and improve the estimation accuracy, a 
number m of registers must be used. The second column 
presents the relationship between the standard error and the 
value of m, where m refers to the number of registers (or the 
number of bits for MultiresolutionBitmap, or the number of 
memory units used by MinCount). The total memory cost of 
an estimator is m multiplied by the size of a register (or 1 bit 
for MultiresolutionBitmap, or 32 bits for MinCount). 

In the last column, we list the memory needed by each 
algorithm to control the standard error around 2% of the actual 
cardinality, which shows the progress in memory saving over 
the past decades: If we use PCS A as the initial benchmark, the 
seminal work of LogLog reduces the memory cost by more 
than half. The followup HyperLogLog (HLL) further cuts the 
memory cost by over 30%. Therefore, HLL is the state-of-the- 
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TABLE I 

A COMPARISON OF POPULAR CARDINALITY ESTIMATORS. 


Algorithm 

Std. Err.(cr) 

Mem Units 

Mem(<7=2%) 

MinCount 

i-oo/v^ 

32-bit keys 

10000 bytes 

PCSA 

0.78/y6n 

32-bit registers 

6084 bytes 

MultiresBitmap 

* 4.4/VfS 

1 bit 

6050 bytes 

LogLog 

1.30/ y/m 

5-bit registers 

2641 bytes 

HyperLogLog 

1.04/ s/m 

5-bit registers 

1690 bytes 

HLL-TailCut 

1.04 /yfrn 

4-bit registers 

1352 bytes 

HLL-TailCut+ 

loam 

3-bit registers 

938 bytes 


art algorithm and has been widely adopted by IT industries, 
such as Google [9], Ask.com [16], PostgreSQL, file-sharing 
P2P systems [18], and network security systems for DDoS 
and scan detection [7], just to list a few. 

It may appear that the cardinality estimators in Table I 
already have small memory overhead (on the scale of KBs), 
and meanwhile can provide good estimation accuracy of about 
2% error. Further reducing their memory cost does not seem 
to be a critically important issue. However, many applications 
need a large quantity of counters to work simultaneously. 
Take network traffic anomaly detection as an example. A core 
router often receives millions of traffic flows in just a few 
minutes. In order to monitor all the flow behavior, it has 
to allocate a cardinality estimator for each flow [10], [11], 
[20], For Google’s applications, the number of counters that 
work simultaneously becomes much larger, greater than one 
billion under extreme cases [9]. Hence, the total memory 
overhead, which is the per-counter memory cost multiplied by 
the number of counters, will be a huge value that could easily 
overwhelm the memory available on devices that maintain 
these counters. For example, on a high-speed router, the on- 
chip SRAM available for online anomaly detection is merely 
on the scale of MBs [10], [11], [20], and on Google servers, 
the DRAM available for tracking keyword popularity is also 
limited, typically on the scale of GBs for a commodity 
server [9]. As a summary, reducing the memory cost of a 
single counter is an important problem with practical value. 

Our Contribution. This paper will present a new cardinality 
estimator named HLL-TailCut+. As shown at the bottom 
row of Table I, when comparing with the state-of-the- 
art HyperLogLog, our algorithm can reduce the memory 
consumption of a single counter again by 45%. A great 
contribution is that we reduce the size of each register from 
5 bits to 3 bits without degrading the accuracy in cardinality 
estimation, which represents an extreme in compactness that 
has not been achieved before. Our technique is called long tail 
cutoff that compresses the information across all registers and 
meanwhile reduces the variance among the registers, which 
in turn reduces the standard error in cardinality estimation. 
Consequently, not only do we have smaller-size registers, 
but also use fewer registers to attain the same accuracy if 
compared with the previous algorithms [5], [7], [8], Moreover, 
unlike HyperLogLog which has limited operating range within 
10 9 , our algorithm can support the counting of data streams 


at Tera or Peta scale. It has no estimation bias on the entire 
measurement range, even when handling small cardinalities. 

II. Related Work 

The cardinality estimation problem is to count the number of 
distinct elements in a stream, wherein each element is allowed 
to appear more than once. A key challenge is that the stream of 
elements can be scanned by just one pass to obtain the result, 
due to the constraint of limited processing time or memory. 
Linear-Space Solutions. A naive solution for this problem 
is to use a hash table to memorize all the elements seen 
so far, in order to filter the duplicated ones. This solution 
has the advantage of knowing the exact cardinality. But it 
needs memory linear to the stream cardinality, which in most 
applications, is far too large to be kept in available memory. 

A well-known algorithm that can approximate the stream 
cardinality is LinearCounting (LC) [19]. It distributes all the 
stream elements uniformly among a bit array, so that each 
element can be encoded as the index of a bit in the array. 
Duplicated stream elements will be mapped to the same bit 
index, and hence are filtered automatically. LC can provide 
the best accuracy among all the known cardinality estimators, 
however under a strict condition that there is sufficient memory 
space roughly linear to the cardinality [16]. Otherwise, its 
accuracy will degrade severely. Since our interest is to estimate 
very large cardinality values on Giga or Tera scale, LC is no 
longer attractive, as it requires too much memory. 
Sublinear-Space Solutions. Researchers have developed a 
whole range of algorithms that requires only sublinear 
memory space [2], [3], [5]-[8], A frequently used method 
for reducing memory cost is sampling. An example is 
MultiresolutionBitmap [6] that designs a sequence of LC struc¬ 
tures, whose sampling probabilities decrease exponentially. 
Another example is MinCount [2], which records only the 
k smallest hash values for a stream of data items. For both 
algorithms, their memory efficiency is worse than LogLog and 
HyperLogLog, as reported by a comparison study [1], 

PCSA (Probabilistic Counting with Stochastic Averaging) 
also prepares a sequence of sampled subsets, but it reduces 
their sampling probability exponentially, until the probability 
becomes so small that a sampled subset has no data [8], 
For the ease of understanding, the sequence of sampled 
subsets is depicted in Fig. 1 as a sequence of buckets, whose 
probability of receiving stream elements reduces by the series 
2 _1 ,2 -2 ,2 -3 ,..., 2~ w . To record whether each bucket has 
received any stream elements, PCSA allocates a bit array in 
memory: If a bucket receives nothing and remains empty, its 
corresponding bit is zero; Otherwise, the bit is one. The x 
mark in Fig. 1 represents that a bit is either zero or one. 

By maintaining the state of this bit array upon stream 
element arrivals, PCSA always knows the index of the leftmost 
empty bucket, which is denoted in Fig. 1 by the symbol 
M'. Such a bit array is called a register, which can give an 
independent estimation of stream cardinality as 2 M . Hence, 
if a PCSA register is given w bits memory, the range 
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of its estimated cardinality is as large as 2 W , which is a 
key advantage of PCSA. Of course, the estimation by a 
single register will be highly inaccurate. For improving the 
accuracy, PCSA uses a technique called stochastic averaging 
that allocates multiple registers to produce independent 
estimations, and returns the average value of their estimations. 

The memory efficiency of PCSA still leaves much space 
for improvement: Its register size must be log 2 n max + 0(1) 
bits, where n max is the upper bound of measured cardinality. 
In contrast, a follow-up algorithm called LogLog reduces the 
memory per register to only log 2 log 2 n max + 0(1) bits [5]. 
Such significant memory compression is because, instead of 
maintaining the state of entire bit array like PCSA, LogLog 
records only the index of the rightmost non-empty bucket, 
which is denoted by the symbol M in Fig. 1. 

HyperLogLog (HLL) is a variant of LogLog for improving 
accuracy [7], Both of them depend on the observation 
of position M shown in Fig. 1, but they adopt different 
methods for aggregating the estimation results by a set of 
registers. LogLog uses geometric averaging, while HLL uses 
harmonic mean, and its purpose is to mitigate the impact of 
outlier registers with abnormally large estimations, thereby 
appreciably increasing the quality of estimations. As shown 
in Table I, the expected error of HLL is 1.04/ a/to, which 
is much smaller than that of LogLog 1.30 /-/to. In a word, 
HyperLogLog is the state-of-the-art algorithm. 

After years of development, it appears to be very difficult to 
further compress the memory cost of a cardinality estimator. 
However, our HLL-TailCut+ estimator can save memory cost 
of HLL again by 45%, based on a long tail cutting technique 
to be proposed in this paper. Our estimator can reduce the 
size of a register to three bits, which is much smaller than 
the five-bit register used by HLL, and meanwhile it provides 
the expected relative error of 1.0 /i/m. Therefore, our HLL- 
TailCut+ algorithm can both reduce the per-register memory 
cost, and discard large outliers to improve accuracy. 

III. Traditional HyperLogLog 

In this section, we introduce the traditional HyperLogLog 
(HLL) algorithm by details, and then identify its inadequacies, 
which motivate the design of our own algorithms. 

A. Basic Idea of HyperLogLog 

For the ease of understanding, we firstly explain the 
estimation procedure of a single HLL register. As shown in 
Fig. 1, when this register receives a stream of elements, it 
distributes these elements exponentially among a sequence of 
buckets, i.e., the probability for the buckets to receive elements 
reduces exponentially by the series 2 _1 ,2 -2 ,2 -3 ,_ 

For implementing this exponential distribution, a hash 
function h is applied to each stream element e. Let us focus on 
the binary representation of a hash value h{e). The probability 
of observing the bit pattern 0 P_1 1 at its beginning is 1/2 P , 
where p is one plus the number of leading zeros. For instance, 
if the hash value h(e) has no leading zeros, then p( 1...) = 1, 
and the probability of observing the bit pattern is 1/2 1 . If 


Probability of 
Receiving Elements: 

Bucket Index: 
Bucket Occupancy: 



M PCSA M: HLL 
Fig. 1. Observation used by PCSA and HyperLogLog. 


there are three leading zeros, then p(0001...) = 4, and the 
chance of observing the bit pattern is 1/2 4 . Therefore, we can 
simply regard the symbol p as the index of the bucket a stream 
element e has been mapped to. 

A HLL register will record the largest p value for all its 
input elements, or say, the register will record the position 
of the rightmost non-empty bucket, which is denoted by 
M in Fig. 1. Because the probability for this bucket to 
receive elements is 1/2 M , intuitively, a good estimation for 
the number of elements the register receives could be 2 M . 

However, the cardinality estimation 2 M by a single register 
is highly variant. For mitigating the high variance, a technique 
called stochastic averaging is adopted: The input data stream 
S is pseudorandomly split into m substreams and then fed 
into to registers. Each register counts the cardinality of its 
input substream independently. When needed, their results are 
aggregated to estimate the cardinality of the data stream S. 

B. Detailed Algorithm Procedure 

Suppose we have allocated m registers M 0 , Mi,..., M m _i. 
The procedure of HyperLogLog can be divided into two parts: 
an online component that processes each stream element and 
records critical information into the set of registers, and an 
offline analysis component that recovers the stream cardinality 
information from the register set. 

Online Component. For an element in stream S, we apply the 
hash function h to it, and the resultant hash value is denoted by 
x. For the binary representation of x, let j be its initial p bits, 
where p = log 2 m or to = 2 P , and let x' be its remaining bits: 

x = h(e), j = (x^-'-Xp), x'= (x p+1 x p+2 ---). 

The integer j decides that the register Mj receives this stream 
element. The integer x' is a hash value that updates Mf 

Mj := max (Mj, p(x')), (1) 

where := is the assignment operator, and ma x(a,b) is a 
function that returns the greater value of its two parameters. 
As stated before, p(x') is one plus the number of leading zeros 
in the binary format of x', for instance, p(0001.. .)=4. Hence, 
when the jth substream is nonempty, the register Mj records 
the index of the rightmost nonempty bucket as in Fig. 1. 
Offline Analysis Component. Each register Mj in the register 
set with 0 < j < m can give an estimation 2 M: > for the 
cardinality of its substream. For aggregating the substream 
cardinalities, HLL uses the normalized harmonic mean: 

n = a m • to 2 • (£o<j<m 2 - M >y\ (2) 
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where a m is a bias correction constant: ai6 = 0.673, a32 = 
0.697, a 64 = 0.709, a m = 0.7213/(l+1.079/m) if m> 128. 

C. Shortcomings of HyperLogLog 

HyperLogLog is an excellent algorithm that provides the 
relative standard error at the cost of 5m bits memory. 
Its high accuracy and memory compactness have triggered 
extensive adoption in IT industries, e.g., Google [9], Ask.com 
[16] and PostgreSQL. However, this algorithm still possesses 
two inadequacies which open doors to further improvements. 
Threat of Outliers. As mentioned before, the observation used 
by HyperLogLog, which is the value of each register Mj, is 
highly variant. To give an impression of the high variance, we 
illustrate in Fig. 2 the probabilistic distribution for a register 
to carry an arbitrary value k. The plot (a) is drawn in normal 
scale, and the plot (b) is drawn by log scale for Y-axis. The 
mathematical formula of this probabilistic distribution will 
be described later in Eq. (3). Here, the two plots show that 
it is a right-skewed distribution with a long tail stretching 
out to the right side of the peak. Note that this property of 
Mj distribution has no relation with the input stream data. It 
originates from the uniform distribution of hash function h. 

The registers whose value strongly deviates from the peak 
are called outliers, which are most likely to exist on the right- 
side long tail of the distribution as illustrated in Fig. 2(b). 
In order to mitigate the impact of the outliers existing on 
the right tail that have abnormally large register values, HLL 
adopts harmonic average to aggregate the estimation results 
of a register set. Our intuition is to completely remove the 
impact of large outliers, by cutting off the right-side long tail 
on such a histogram, which contains plenty of outliers instead 
of useful information. It may appear that the outlier rejection 
can be easily implemented by discarding the registers whose 
values are much larger than the average. But the difficulty is 
how to achieve the tail cutoff when the size of each register 
is reduced to less than five bits for space saving. 



Fig. 2. Probability of register values, when m = 512 and load factor ^ = 100. 

Inefficient Register Encoding. The second inadequacy of 
HyperLogLog is its inefficient encoding of each register 
state. The size of a HLL register is five bits long, so that 
the cardinality estimation by a single register can be up to 
2 2 « 4 x 10 9 . A recent paper proposed to expand the register 
size to six bits, in order to support the counting of big data on 
Tera or Peta scale [9]. On the contrary, we discover that it is 


possible to reduce the register size to four bits or even to three 
bits, and meanwhile support the same large operating range. 

Our inspiration comes from the Fig. 2(b), where only for 
the sixteen highest bars between 4 and 19, their probabilities 
are greater than 0.01%. It implies that, when the number of 
registers to is on the scale of thousands, the spread of a register 
set (i.e., the largest register value minus the smallest register 
value) is less than sixteen in most cases. From the perspective 
of information theory, it is redundant to use five bits to encode 
each register, and four bits may be sufficient in most cases. 

Moreover, in Fig. 2(a), only for the eight highest bars 
between 5 and 12, their probabilities are greater than 2%. 
These eight bars are the most informative part of a histogram, 
and others are more prone to contain outliers, which implies 
the possibility of abandoning the right tail for outlier rejection 
and encoding each register by only three bits memory. 
Conclusion. Our basic idea is that the memory per register 
may be compressed to four bits or even to three bits with 
no significant loss of useful information. Due to the smaller 
register size, we can allocate a larger number of registers from 
the same memory budget, which drives down the estimation 
error. However, what we have proposed is lossy compression 
of registers, and the challenge is how to avoid its side effect. 

IV. MLE-based HyperLogLog 

Before introducing our algorithm, we replace the estimation 
equation of HyperLogLog in (2) with an alternative formula in 
(5), which is based on MLE (maximum likelihood estimation). 
The analysis in this section is the theoretical foundation of 
our own algorithm. Moreover, HyperLogLog has another 
inadequacy, which is strongly biased when handling small 
cardinalities (see Fig. 6(a) in the simulation section). Our 
MLE-based substitute can solve this problem and provide 
unbiased estimations in the entire measurement range. 


A. Maximum Likelihood Estimator 

In this subsection, we will present a maximum likelihood 
estimator for the number of distinct elements in a data stream. 

We analyze the probabilistic distribution for a HLL register 
to carry an arbitrary k value, which has been illustrated in 
Fig. 2, and we have the following theorem. 


Theorem 1 (Probability of Register Value). The probability 
for a register (for instance, the jth register Mj ) to demonstrate 
a particular value k is approximately 


Pr{Mj = k} 



ifk = 0, 
if k > 1. 


(3) 


Proof Check Appendix A of the extended version [14]. □ 


For an arbitrary non-negative k value, let be the 
number of registers, among the to registers, which carry the k 
value. If observing exactly Nk registers carrying a particular 
k value, the probability of this observation is Pr{Mj = 
k} Nk , assuming these registers are mutually independent. 
Then, the combined probability of making the observations 
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V. HLL-TailCut Algorithm 


No, N\,... , TVoo for all the k values from zero to infinity is 
as follows, under the condition that the stream cardinality is n. 

Pr{N 0 , N lt ... ,-iVoo | n} = ^ 

fe=o 

However, it is impossible to measure the number of registers 
Nk for an arbitrarily large k value up to infinity, because each 
register are given limited memory space (typically 5 bits). We 
use a symbol K to characterize a register’s up-bound capacity 
of recording k value. For example, if each register is given 5 
bits, then it can record a limited range of k values starting from 
0 up to 2 5 — 1 = 31, and in this case, 1C = 32. If each register 
is of 4 bits, then the threshold /C = 2 4 = 16. Considering 
this upper limit of recording k values, the above probability 
function needs to be modified, assuming only the availability 
of the observations N 0 , Ni,..., N/c-i- 

Pi" {No, Ni,..., N/c-i | n} Pd = k} 


In this section, we reduce the size of each HyperLogLog 
register from five bits to four bits, and we call the new 
algorithm HLL-TailCut, because it essentially applies the long 
tail cutoff technique to the histogram in Fig. 2. HLL-TailCut 
reduces the memory cost of HLL by 20%. Unlike HLL, it can 
support the counting of Tera- or Peta-scale data streams. 

A. Base Register and Offset Registers 

Our basic idea is to use a shared base register for storing 
the smallest value among the set of HyperLogLog registers, so 
that the m registers only need to store their offsets relative to 
the base register. Intuitively, the offset stays in a much smaller 
range than 2 5 (see Fig. 2) and can be encoded by less than five 
bits. In following, we explain how to maintain the base register 
and the m offset registers, upon the arrival of stream elements. 

Let B be the shared base register that records the smallest 
lvalue of the HLL register set. Due to the base register, Eq. (1) 
that updates each offset register Mj should be changed to 


This probability function is also called the likelihood of 
unknown parameter n, when given the observations about the 
number of registers carrying each value: No, Ni,..., N^. 


Mj := max (Mj, p(x') - B), 


(7) 


C(n\No,N 1 ,...,N lc _ 1 ) 


where Mj has an upper tilde indicating it is the jth offset 
register that records the offset of Mj relative to base register 
n Pr{Mj=k} Nk p(x') is the index of the bucket a stream element is mapped 
k=0 to, and p(x') —B is the offset of the bucket index relative to B. 


Applying the well-known maximum likelihood estimation, 
we can find the best n value that maximizes this log-likelihood 
function, and we use the symbol h to denote this optimized 
estimation of the stream cardinality n. 

h = argmax log£(n \N 0 ,N U .. .,N K -i) (5) 

B. Gradient Ascent Solution for MLE 

In this subsection, we present our solution to the MLE 
optimization problem in (5). Although it is viable to solve this 
problem symbolically by finding the closed-form root to the 
equation ^ log C(n \ N 0 , Ni,..., N/c i) = 0, this solution 
will be complex and have low flexibility (We will demonstrate 
this point in the next section, when the symbol 1C is configured 
to some other value smaller than 2 5 ). Therefore, we choose to 
solve this optimization problem numerically. 

We use the following iterative optimization procedure to 
obtain an optimized estimation of stream cardinality n: 

= n (i) + V ■ log £(n« | N 0 , N lt ..., N K -t% (6) 

where log C(n \ No, N ±,..., IVjc-i) is the gradient of log- 
likelihood function, whose mathematical expression is given in 
Appendix B, nW is the current estimation of n, is the 

next-round estimation, rj is the optimization step size assigned 
to 2 b to, and B is the smallest value among all registers. 

For the above iterative optimization method, its computa¬ 
tional cost is only to evaluate the gradient of log-likelihood 
function for ten or twenty rounds. Moreover, we can speed up 
its convergence, if we generate a good initial guess using the 
closed-form cardinality estimator in (2) by HyperLogLog. 


Handle Overflow of Offset Register. We define each offset 
register to be four bits long. Thus, an offset register’s recording 
capacity /C is 2 4 = 16, implying that the recorded offset value 
must be smaller than K = 16. However, occasionally, the 
offset values p(x') — B of some stream elements are at least 
1C. We use the term “overflow” to refer to the attempts of 
updating the offset register Mj to the /C value or above. 

In order to handle the overflow event, we scan the to offset 
registers to find the smallest offset value, which is denoted as 
A B. If A B is non-zero, it implies that the shared base register 
B can be increased by this amount to reduce the offset value 
stored in each offset register. We call this operation “update 
the base register”: Whenever B is increased by A B, each offset 
register Mj needs to be decreased by AB, as they record the 
offsets to B. Thanks to this base register updating operation, 
we can easily count the data streams on Tera or Peta scale. 

After the increase of the base register by AB, the new 
offset value p(x') — B in (7) may become smaller than 1C. 
If that is true, then the overflow event disappears. Otherwise, 
the overflow problem can not be resolved, and the jth offset 
register has to be truncated by the cutoff bound K. as follows. 

Mj := max (Mj, min^a;') — B, 1C - 1)) (8) 

B. Pseudocode of HLL-TailCut 

In this subsection, we describe the procedure of the HLL- 
TailCut algorithm (also abbreviated as HLL-TC), which can 
be divided into two parts: an online component that updates 
the base register B and offset registers Mj, 0 < j <m, upon 
the arrival of stream elements, and an offline component that 
estimates the stream cardinality n using these registers. 
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We present the pseudocode of the online component in 
Algorithm 1. We use the term “truncated register” to refer 
to the sum of base register B and offset register Mj. The 
Algorithm 1 essentially maintains a set of truncated registers 
B + Mj, 0 < j < m. Different from the HLL register Mj 
maintained by (1), these truncated registers chop off the long 
tail of a histogram (like in Fig. 2) by the bound B+K, and the 
chopped part is stacked above the (B + /C — l)th bar, due to the 
line 9. Thus, the resultant histogram will exhibit an edge peak 
distribution with a spike close to the tail truncation point. 


Algorithm 1: Online Component of HLL-TailCut 

1 initialize B and Mj to zero, for each j £ [0, to) 

2 foreach element e in data stream S do 

3 x := h(e), j:={x ix 2 ---x b ), x':= (x b+1 x b+2 ■ ■ ■) 

4 if p(x') -B>K. then // detect overflow 

5 A B := mino^cm Mj 

6 if A B > 0 then 

7 I B := B + A B / / update base register 

8 |_ foreach j £ [0, m) do Mj := Mj - A B 

9 Mj := max (Mj , min(p(x') — B, K — 1)) 


Since the online component above no longer maintains the 
HLL register Mj, we need to modify the offline estimation 
equation in (2), using the newly designed base register B and 
offset registers Mj. A straightforward solution is to replace 
Mj by the jth truncated register B + Mj. 

n = a m -rn 2 • (£o<i<m (9) 

C. Performance Evaluation of HLL-TailCut 
In this subsection, we evaluate how the estimated result by 
(9) is affected, when we truncate the right-side tail of a HLL 
histogram by B + K.. In Appendix C of extended version [14], 
we prove HLL-TC in (9) can generate unbiased estimations, 
if each offset register Mj is given four-bits memory. In 
Appendix C, we also prove that, if multiple HLL-TC 
estimators are deployed at different locations, these estimators 
can be composed to estimate the union of data streams. 

In following, we use experiments to verify that the tail 
cutoff across the boundary B + 16 has negligibly small impact 
to the cardinality estimation by (9). We plot the evaluation 
results in Fig. 3. The subfigure (a) illustrates the estimation 
bias E(h — n)/n, where n is the actual cardinality and n 
is the estimated value. The subfigure (b) depicts the relative 
standard deviation of estimated results yJVar[h)/E(h). We 
illustrate both the results of LinearCounting and HLL-TC, 
which are configured with the same number of memory units: 
LinearCounting is given to bits, and HLL-TC is given to offset 
registers. Plot (a) shows that the tail cutoff won’t cause severe 
bias to HLL-TC, when the cutoff bound is B + 16. 

VI. HLL-TailCut+ Algorithm 
I n this section, we reduce the size of each offset register to 
three bits, and save the memory cost by over 40% than HLL. 



actual cardinality (x1000) actual cardinality (x1000) 


(a) Estimation bias (b) Standard deviation 4.6% 

Fig. 3. Performance of HLL-TailCut allocated with 512 offset registers, each 
of which is given four bits memory. 

A. Bias Problem of the Naive HLL-TailCut 

When the offset register size reduces to three bits, we assign 
the cutoff bound K to 2 3 = 8, and then we can reuse the online 
component in Algorithm 1 to maintain the base register B and 
each offset register Mj, upon the arrival of stream elements. 
However, the offline analysis component in (9) used by the 
naive HLL-TailCut has a serious “estimation bias” problem, 
which will be identified and explained as follows. 

HLL-TC adopts an estimation equation in (9) similar to 
HyperLogLog. For this solution, we illustrate its experimental 
results in Fig. 4. The subfigure (a) shows that HLL-TC with 
cutoff bound K = 8 produces the estimation bias of —5.2%. 
This is because, when the offset register is three bits and 1C 
reduces to eight, the percentage of registers truncated by (8), 
called overflow probability, will greatly increases to about 5%. 
Thus, a non-negligible fraction of offset registers are truncated. 

To make things worse, in Fig. 4(a), the bias of HLL-TC 
exaggerates to —5.2% by a non-linear curve, implying that we 
cannot compensate such bias simply by applying a constant 
corrector to the biased estimation result. 



actual cardinality (xlOOO) actual cardinality (xlOOO) 


(a) Estimation bias (b) Standard deviation 4.4% 

Fig. 4. Performance of HLL-TailCut configured with 512 offset registers, 
each of which is given three bits. 

Interestingly, the standard deviation of HLL-TC decreases 
from 4.6% shown in Fig. 3(b) to 4.4% shown in Fig. 4(b). 
This is because more outliers in the long tail are discarded, as 
the tail cutoff bound changes from 16 to 8. More aggressive 
outlier rejection brings a small degree of accuracy gain. 

B. Probabilistic Model of Truncated Register 

To address the negative bias problem of HLL-TC, we will 
propose a HLL-TailCut+ algorithm, which modifies the MLE- 
based HyperLogLog algorithm discussed in Section IV. We 
have already depicted its performance in Fig. 4. Plot (a) shows 
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that it can provide unbiased estimations in the entire range, and 
plot (b) shows that it has comparable accuracy with HLL-TC. 

Before introducing this new algorithm, in this subsection, 
we analyze the probability of a truncated register B + Mj to 
exhibit an arbitrary k value. The key difficulty for this analysis 
is that, during the execution time of the online component, the 
shared register B is not a fixed value, but gradually increases 
as the register set receives more and more stream elements. 

We begin by defining several notations. Let 6 be the final 
value which the base register B has been updated to. Note 
that b is typically a small value even for a very large stream 
at the scale of 10 9 . For example, let the number of registers 
to be 1024. When the stream size n equals 10 2 m, the b value 
alternates between 4 and 5, as in Fig. 7(a). When n increases 
to 10 6 to (about 10 9 ), the b value grows only to 17 or 18. 

As the base register B undergoes the step-by-step increase 
in the range of [0,6], the register set consisting of m offset 
registers will receive different numbers of stream elements. 

• When B is equal to 0, we assume that the register set 
receives no distinct stream elements. 

• When B is equal to 1, the register set receives n\ stream 
elements that are distinct from the previous no elements. 

• . . . 

• When B is equal to 6, the register set receives to, stream 
elements that are distinct from the elements n 0 ,ni,..., 
TO>_ 1 received when B is equal to the previous values. 
The purpose of our problem is to estimate the total cardinality 
n of the data stream, which is equal to no + ni + ... + rib. 

Let Mj°\ Mj'\ • • •, Mj bi be the values of the jth offset 
register, when the base register B is fixed to 0,1,..., 6 and 
the register set independently receives no, to , to, distinct 
elements, respectively. For example, is the value of jth 
offset register, when the base register B is fixed to 1 and 
the register set receives n\ unique elements that are totally 
different from the no elements received when B is still zero. 

After the register set receives all the n = np4-ni +... + TO, 
stream elements, the jth truncated register B + Mj becomes 

B + Mj = max(£ + Mj 0) , B + m] 1} , B + Mf ] ). 

Because B + Afj 0) ,B + M^,,B + M < j b> are independent, 
the cumulative probability for B + Mj (i.e., the probability for 
the jth truncated register to carry a value of at most k) is 

Pr{B + Mj < k | n 0 , ni,..., n b } = 

Uo<i< b p r{^ + M^ <k\ ni }. (10) 

Here, it needs the cumulative distributions of the jth truncated 
register Pr{B+Mj < k \ n t }, when the base register is fixed 
to a value i ranging from 0 to 6. If the base register is equal to 
6 value, the cumulative distributions Pr{B + Mj < k \ rib} 
is given in the following theorem. When the base register is 
equal to other values 0, 1, ..., or 6 — 1, we can easily obtain 
their corresponding cumulative probability, if we replace the 
symbol 6 in (11) by 0, 1, ..., or 6 — 1, respectively. 


Theorem 2 (Cumulative Distribution of Truncated Register 
B + Mj b with Fixed Base Register). When the base register 
B is fixed to a value b and the register set receives rib distinct 
elements, the probability for the truncated register B + Mj b> 
to exhibit a value of no more than k is as follows. 

Pr{B + Mf ] <k\n b }^ (11) 

ifO<k<b + K.-2 
jl ifk>b + K-l 

Proof. Directly derived from Theorem 1. □ 


By applying (11) to (10), we can obtain the cumulative 
probability of the jth truncated register Pr{B + Mj < 
k | no, ni,... ,71ft}. We refrain from expanding this formula, 
which otherwise will become too complicated. Then, with the 
cumulative probability in (10), we can derive the probability 
density function for the jth truncated register B + Mj. 


Pr{B + Mj = k | n 0 , n 1 ,..., n b } = 

! 0 

Pr{B + Mj < k | no,nx,... , n b } 
Pr{B + Mj < k | n 0 , ni,..., rib} ~ 
Pr{B + Mj < k— 1 | no, ni,..., rib} 


( 12 ) 


if k < b 


if k > b 


Here, the probability for the truncated register to take a value 
less than 6 is zero, because the base register B increases to 6 
after receiving all the n elements, which makes it impossible 
for the truncated register B + Mj to be smaller than 6. 


C. Maximum Likelihood Estimator 
As the probability for a truncated register B + Mj to carry 
an arbitrary k value is available in (12), the only problem 
that remains is how we use this parameterized probabilistic 
model with 6 unknown variables no,ni,... ,n b , to generate 
an unbiased estimation of the total stream cardinality n. 

We address the problem by estimating the 6 unknown 
parameters one by one. When the base register B is about 
to increase from zero to one, we estimate no, the number 
of distinct elements received. To accomplish this task, since 
the base B is still zero, we can use directly the maximum 
likelihood estimator in Section IV. Note that when the 
estimation of no is smaller than m, we will use instead the 
estimated result by LinearCounting [19] for better accuracy, as 
inspired by the work [16] that argues LinearCounting is more 
accurate than HyperLogLog if given enough memory space. 

Then, following the principle of mathematical induction, we 
assume that the stream cardinalities n 0 , ni,..., n b -\ all have 
been estimated as n 0 , nj, • • •, n b - 1, at the time that the base 
register B is about to update to 1,2,..., 6, respectively. Based 
on them, we will further estimate the next unknown variable 
nft. The likelihood function of n b is as follows. 

C(n b | N^Ni,.. .,N b+ ic-i) 

ntto _1 Pr i B + Mj = k I no,ni,...,nftl),nft}' v * (13) 
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The probability density function Pr{B+Mj = k | no, ni,..., 
rib} of truncated register B+Mj is in (12). We replace the true 
values of no, ni,..., rib- 1 by their estimated values in (13). 

By maximizing the log-likelihood function of rib, we obtain 
an optimized estimation of rib. 

ri b = arg max log C{rib \ N 0 ,Ni,.. .,N b+ K- 1 ) (14) 


To solve this maximum likelihood problem efficiently, we use 
a steepest-ascent optimization method, which is described in 
the Appendix D of the extended version [14]. 

Because the stream cardinalities n 0 , ,..., all have 

been estimated, we can obtain an estimation of the total stream 
cardinality as n = no + ri\ + ... + rib. We call this algorithm 
HLL-TailCut+ (abbreviated as HLL-TC+). Unlike HLL-TC, 
this algorithm has no bias problem as illustrated in Fig. 4(a). 


D. Analysis of Memory Cost 

The memory cost of HLL-TC+ is the number of offset 
registers m multiplied by three bits, and its relative standard 
error is roughly We obtain this relative error by 
applying HLL-TC+ to a fixed cardinality (e.g., ten million) for 
ten thousand times, and then calculating the relative standard 
deviation of estimated results. Later in Table II, we will 
use more extensive experiments to verify this relative error 
equation also applies for other to values and other n values. 

Since the standard error of HLL-TC+ is and that of 
HLL is we can show HLL-TC+ only needs 55% memory 
of HLL to attain the same accuracy. Let tohll-tc+ (or to,hll) 
be the number of registers used by HLL-TC+ (or HLL). Then, 
we have ^ tc = to attain the same accuracy. Since 

the register size of HLL is five bits and that of HLL-TC+ is 
only three bits, the memory cost of HLL-TC+ divided by that 
of HLL is » § • (LA ) 2 « 55%. 

VII. Experiments 


In this section, we evaluate the performance of our proposed 
HLL-TC and HLL-TC+ algorithms, and compare them with 
state-of-the-art algorithms, including HyperLogLog (HLL) [7] 
and HyperLogLog+ (HLL+) [9]. Note that we have shared 
online the source code of all these four algorithms [15], 
Experiment Setup. For each cardinality estimator, we will 
evaluate two performance metrics: the average estimation bias 
and the average estimation error when given a same amount of 
memory. We will evaluate the performance of the cardinality 
estimators under three different scenarios. First, we assume 
very limited memory budget, no more than a few hundreds 
bytes per stream, to support cardinality measurements with 
coarse accuracy ranging from 4% to 10%. Second, we 
assume the available memory is several kilobytes per stream, 
which enables highly accurate estimations with the expected 
errors lower than 2% or even 1%. Third, we would like to 
verify whether our HLL-TailCut+ estimator can support the 
measurement of extra large streams whose cardinalities exceed 
4 x 10 9 . This bound is important since a five-bit HLL register 
can only count cardinalities up to 2 2 ~ 4 x 10 9 . 


Coarse-Accuracy Estimation. We consider the coarse 
accuracy o « 4.4%. Then, the number of registers m should be 
(1.0/0.044) 2 « 512 for HLL-TC+, occupying 512x3 = 1.54k 
bits memory. We give the same amount of memory to the other 
three algorithms, and depict their performance in Fig. 5. 

Fig. 5(a) shows that all four algorithms are approximately 
unbiased. Fig. 5(b) shows that the estimation error of HLL is 
slightly smaller than the error of HLL+. This is because HLL 
defines the register size to be five bits to support the counting 
of data on Giga scale, while HLL+ enlarges the register size 
to six bits, to extend the operating range to Tera or Peta 
scale. Hence, when given the same memory budget, HLL+ can 
allocate a smaller number of registers than HLL. Fig. 5(b) also 
shows that our HLL-TC and HLL-TC+ algorithms can provide 
smaller estimation error. This is because HLL-TC and HLL- 
TC+ have compressed the register size to four bits and three 
bits, respectively. Given the same amount of memory, they can 
allocate more registers to achieve higher accuracy. 



actual cardinality (xlOOO) actual cardinality (xlOOO) 

(a) Estimation bias (b) Standard deviation 

Fig. 5. Compare cardinality estimators with the same 1.54k bits memory. 

Fine-Accuracy Estimation. We consider the estimation error 
a 1.1%. To achieve such fine accuracy, HLL-TC+ needs 
about (1.0/0.011) 2 « 8192 registers, which occupies 24.58k 
hits memory. We give the same amount of memory to the other 
three algorithms, and evaluate their performance. Fig. 6(b) 
shows that HLL-TC+ provide the best accuracy among the 
four algorithms, and its expected error is 1.0 j\fm ~ 1.1%. 



actual cardinality (xlOOO) actual cardinality (XlOOO) 

(a) Estimation bias (b) Standard deviation 


Fig. 6. Compare cardinality estimators with the same 24.58k bits memory. 

Fig. 6(a) shows that HLL has a high spike that is strongly 
biased. This is because, in the small region around 2.5m = 
2.5 • 8192 ■ « 12288, HLL makes a switch between 

LinearCounting and its raw estimation equation in (2). This 
bias problem has also been elaborated by previous work [9], 
As shown in Fig. 6(a), this bias problem has been solved 
by HLL+, HLL-TC and HLL-TC+, however using different 
methods. HLL+ corrects the bias in a brute-force way [9]: It 
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empirically calculates the bias of 200 hundred reference values 
in the small region from 2m to 5m, and then interpolates 
between the 200 reference points to determine the correction 
to apply for any given raw estimation value by (2). In contrast, 
our HLL-TC addresses this problem elegantly, by substituting 
the equation (2) with a MLE estimator in (5), within the small 
region from 2m to 5m. Our HLL-TC+ also does not have the 
bias problem, because it uses the MLE estimator in (14). 
Extra Large Measurement Range. The previous experiments 
only show the evaluation results for cardinalities up to one 
million. In following, we will verify that our HLL-TC+ can 
measure extra large streams that have over four billions distinct 
elements. Unlike HyperLogLog+ which increases the register 
size to six bits to support such large streams, we only need an 
array of three-bits offset registers (whose number is m) plus 
a single base register which is at least six bits long. 

We list in Table II the average estimation bias and error of 
our HLL-TC+ algorithm, when it is given different numbers 
of registers m, such as 2 10 , 2 12 and 2 13 . We only show the 
experimental results of a single stream cardinality value 16 K 
10 9 , since it takes days to process such a large data stream 
for ten thousands times. In this table, the second column lists 
the average estimation bias of HLL-TC+, which is negligibly 
small as compared with its standard deviation shown in the 
third column. This implies that our algorithm can unbiasedly 
estimate extra large streams beyond the bound of four billions. 


TABLE II 

Apply HLL-TC+ to data streams with 16 x 10 9 distinct elements. 


Register Number (m) 

Avg Bias 

Std Deviation 

Error Eqn 

1024 

-0.02% 

3.13% 

1.00/v/m 

4096 

-0.06% 

1.56% 

1.00/Vm 

8192 

-0.01% 

1.11% 

1-00/Vm 


The last column of Table II rewrites the standard deviation 
of estimated results (shown in the third column) into the form 
of a constant divided by \Jrri. It shows that the standard 
deviation of our algorithm can be accurately approximated by 
1.0/Vm. According to our previous analysis in Section VI-D, 
if the expected relative error of HLL-TC+ is 1.0/y/m, then it 
can save 45% memory cost than traditional HyperLogLog. 
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VIII. Conclusion 

This paper studies a fundamental problem called cardinality 
estimation, in the domain of one-pass processing of streaming 


data. We present a new solution named HLL-TailCut+, which 
is able to reduce memory consumption by 45% than the 
state-of-the-art HyperLogLog. This remarkable improvement 
originates from a technique we proposed that truncates the 
right-side long tail of the register distribution of HyperLogLog. 
This technique brings two key benefits — improve estimation 
accuracy by rejecting outliers in the long tails, and compress 
the register size by recording only eight highest bars in the 
histogram of HLL. Therefore, our algorithm can provide the 
standard error -M= using only three-bit memory per register. 
Moreover, this HLL-TailCut+, based on maximum likelihood 
estimation, can provide approximately unbiased estimations 
in the entire range of cardinality, even at the point where it 
switches to LinearCounting for handling small streams. 
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